Choose one member of your group and listen together to some of the songs on their Discover Weekly playlist. Try to make a prediction about the centre of their taste cluster with respect to Spotify API features (e.g., high/low danceability, high/low energy, high/low liveness, etc.).
Suppose you want to use the results of this breakout session to make the claim (or refute the claim) Spotifyโs Discover Weekly algorithm can identify usersโ musical taste. Would this โexperimentโ have statistical validity? External validity? Internal validity? Construct validity? Why or why not?
We will be using the developing tidymodels framework this week for integrating with the different machine-learning libraries in a consistent manner. You can install this package from the usual RStudio Tools menu. All of the other tools that are strictly necessary for clustering are available in base R. For full flexibility, however, the ggdendro, protoclust, and heatmaply packages are recommended. If you want to explore further possibilities, look at the cluster package.
As you work through the breakout sessions, you will occasionally get error messages asking you to install other packages, too. Install whatever R asks for from the Tools menu, and then try running the chunk again. Two helper functions are also included here: get_conf_mat() and get_pr().
library(tidyverse)
## โโ Attaching packages โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ tidyverse 1.3.0 โโ
## โ ggplot2 3.3.3 โ purrr 0.3.4
## โ tibble 3.1.0 โ dplyr 1.0.5
## โ tidyr 1.1.3 โ stringr 1.4.0
## โ readr 1.4.0 โ forcats 0.5.1
## โโ Conflicts โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ tidyverse_conflicts() โโ
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidymodels)
## โโ Attaching packages โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ tidymodels 0.1.2 โโ
## โ broom 0.7.5 โ recipes 0.1.15
## โ dials 0.0.9 โ rsample 0.0.9
## โ infer 0.5.4 โ tune 0.1.3
## โ modeldata 0.1.0 โ workflows 0.2.1
## โ parsnip 0.1.5 โ yardstick 0.0.7
## โโ Conflicts โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ tidymodels_conflicts() โโ
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
library(ggdendro)
library(heatmaply)
## Loading required package: plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Loading required package: viridis
## Loading required package: viridisLite
##
## Attaching package: 'viridis'
## The following object is masked from 'package:scales':
##
## viridis_pal
##
## ======================
## Welcome to heatmaply version 1.2.1
##
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
##
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## Or contact: <tal.galili@gmail.com>
## ======================
library(spotifyr)
##
## Attaching package: 'spotifyr'
## The following object is masked from 'package:yardstick':
##
## tidy
## The following object is masked from 'package:rsample':
##
## tidy
## The following object is masked from 'package:recipes':
##
## tidy
## The following object is masked from 'package:parsnip':
##
## tidy
## The following object is masked from 'package:broom':
##
## tidy
library(compmus)
get_conf_mat <- function(fit) {
outcome <- .get_tune_outcome_names(fit)
fit %>%
collect_predictions() %>%
conf_mat(truth = outcome, estimate = .pred_class)
}
get_pr <- function(fit) {
fit %>%
conf_mat_resampled() %>%
group_by(Prediction) %>% mutate(precision = Freq / sum(Freq)) %>%
group_by(Truth) %>% mutate(recall = Freq / sum(Freq)) %>%
ungroup() %>% filter(Prediction == Truth) %>%
select(class = Prediction, precision, recall)
}
In order to demonstrate some of the principles of classification, we will try to identify some of the features that Spotify uses to designate playlists as โworkoutโ playlists. For a full analysis, we would need to delve deeper, but letโs start with a comparison of three playlists: Indie Pop, Indie Party, and Indie Workout. For speed, this example will work with only the first 20 songs from each playlist, but you should feel free to use more if your computer can handle it.
After you have this section of the notebook working, try using other combinations of Spotify workout playlists with similarly-named non-workout playlists.
pop <-
get_playlist_audio_features("spotify", "37i9dQZF1DWWEcRhUVtL8n")
party <- get_playlist_audio_features("spotify", "37i9dQZF1DWTujiC7wfofZ")
workout <- get_playlist_audio_features("spotify", "37i9dQZF1DXaRL7xbcDl7X")
indie <-
bind_rows(
pop %>% mutate(playlist = "Indie Pop") %>% slice_head(n = 20),
party %>% mutate(playlist = "Indie Party") %>% slice_head(n = 20),
workout %>% mutate(playlist = "Indie Workout") %>% slice_head(n = 20)
)
As you think about this lab session โ and your portfolio โ think about the four kinds of validity that Sturm and Wiggins discussed in our reading for last week. Do these projects have:
We bind the three playlists together using the trick from the beginning of the course, transpose the chroma vectors to a common tonic using the compmus_c_transpose function, and then summarise the vectors like we did when generating chromagrams and cepstrograms. Again, Aitchisonโs clr transformation can help with chroma.
indie_features <-
indie %>% # For your portfolio, change this to the name of your corpus.
add_audio_analysis() %>%
mutate(
playlist = factor(playlist),
segments = map2(segments, key, compmus_c_transpose),
pitches =
map(
segments,
compmus_summarise, pitches,
method = "mean", norm = "manhattan"
),
timbre =
map(
segments,
compmus_summarise, timbre,
method = "mean",
)
) %>%
mutate(pitches = map(pitches, compmus_normalise, "clr")) %>%
mutate_at(vars(pitches, timbre), map, bind_rows) %>%
unnest(cols = c(pitches, timbre))
In the tidyverse approach, we can preprocess data with a recipe specifying what we are predicting and what variables we think might be useful for that prediction. Then we use step functions to do any data cleaning (usually centering and scaling, but step_range is a viable alternative that squeezes everything to be between 0 and 1).
indie_recipe <-
recipe(
playlist ~
danceability +
energy +
loudness +
speechiness +
acousticness +
instrumentalness +
liveness +
valence +
tempo +
duration +
C + `C#|Db` + D + `D#|Eb` +
E + `F` + `F#|Gb` + G +
`G#|Ab` + A + `A#|Bb` + B +
c01 + c02 + c03 + c04 + c05 + c06 +
c07 + c08 + c09 + c10 + c11 + c12,
data = indie_features, # Use the same name as the previous block.
) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors()) # Converts to z-scores.
# step_range(all_predictors()) # Sets range to [0, 1].
The vfold_cv function sets up cross-validation. We will use 5-fold cross-validation here in the interest of speed, but 10-fold cross-validation is more typical.
indie_cv <- indie_features %>% vfold_cv(5)
Your DataCamp tutorials this week introduced four classical algorithms for classification: \(k\)-nearest neighbour, naive Bayes, logistic regression, and decision trees. Other than naive Bayes, all of them can be implemented more simply in tidymodels.
A \(k\)-nearest neighbour classifier often works just fine with only one neighbour. It is very sensitive to the choice of features, however. Letโs check the performance as a baseline.
knn_model <-
nearest_neighbor(neighbors = 1) %>%
set_mode("classification") %>%
set_engine("kknn")
indie_knn <-
workflow() %>%
add_recipe(indie_recipe) %>%
add_model(knn_model) %>%
fit_resamples(
indie_cv,
control = control_resamples(save_pred = TRUE)
)
indie_knn %>% get_conf_mat()
## Truth
## Prediction Indie Party Indie Pop Indie Workout
## Indie Party 8 5 4
## Indie Pop 3 9 9
## Indie Workout 9 6 7
These matrices autoplot in two forms.
indie_knn %>% get_conf_mat() %>% autoplot(type = "mosaic")
indie_knn %>% get_conf_mat() %>% autoplot(type = "heatmap")
We can also compute precision and recall for each class.
indie_knn %>% get_pr()
## # A tibble: 3 x 3
## class precision recall
## <fct> <dbl> <dbl>
## 1 Indie Party 0.471 0.4
## 2 Indie Pop 0.429 0.45
## 3 Indie Workout 0.318 0.35
Decision trees are nicely intuitive, and perform somewhat better here.
tree_model <-
decision_tree() %>%
set_mode("classification") %>%
set_engine("C5.0")
indie_tree <-
workflow() %>%
add_recipe(indie_recipe) %>%
add_model(tree_model) %>%
fit_resamples(
indie_cv,
control = control_resamples(save_pred = TRUE)
)
indie_tree %>% get_pr()
## # A tibble: 3 x 3
## class precision recall
## <fct> <dbl> <dbl>
## 1 Indie Party 0.429 0.3
## 2 Indie Pop 0.5 0.6
## 3 Indie Workout 0.318 0.35
We can look at the whole tree with the summary command. Be careful not to read too much into the actual numerical values, however: remember that the features were standardised before we started classification. Without cross-validation, the algorithm looks much better from the summary than it actually was in practice, but we can still see that timbre features are important and chroma features probably arenโt.
workflow() %>%
add_recipe(indie_recipe) %>%
add_model(tree_model) %>%
fit(indie_features) %>%
pluck("fit", "fit", "fit") %>%
summary()
##
## Call:
## C5.0.default(x = x, y = y, trials = 1, control = C50::C5.0Control(minCases =
## 2, sample = 0))
##
##
## C5.0 [Release 2.07 GPL Edition] Wed Mar 17 10:55:00 2021
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 60 cases (35 attributes) from undefined.data
##
## Decision tree:
##
## A#\|Bb <= -1.259173: Indie Pop (6)
## A#\|Bb > -1.259173:
## :...duration <= -0.6008952:
## :...c03 <= -0.5632606: Indie Pop (4)
## : c03 > -0.5632606: Indie Party (10/1)
## duration > -0.6008952:
## :...energy <= 0.1583392:
## :...liveness > 0.8180664: Indie Party (2)
## : liveness <= 0.8180664:
## : :...c07 <= -1.007623: Indie Pop (5)
## : c07 > -1.007623:
## : :...c10 <= -1.069226: Indie Pop (3)
## : c10 > -1.069226: Indie Workout (11/2)
## energy > 0.1583392:
## :...c09 <= -2.337745: Indie Party (2)
## c09 > -2.337745:
## :...c10 <= 0.3039758: Indie Workout (8)
## c10 > 0.3039758:
## :...danceability <= -0.3527262: Indie Party (5)
## danceability > -0.3527262: Indie Workout (4/1)
##
##
## Evaluation on training data (60 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 11 4( 6.7%) <<
##
##
## (a) (b) (c) <-classified as
## ---- ---- ----
## 18 2 (a): class Indie Party
## 1 18 1 (b): class Indie Pop
## 20 (c): class Indie Workout
##
##
## Attribute usage:
##
## 100.00% A#\|Bb
## 90.00% duration
## 66.67% energy
## 51.67% c10
## 35.00% liveness
## 31.67% c07
## 31.67% c09
## 23.33% c03
## 15.00% danceability
##
##
## Time: 0.0 secs
forest_model <-
rand_forest() %>%
set_mode("classification") %>%
set_engine("ranger", importance = "impurity")
indie_forest <-
workflow() %>%
add_recipe(indie_recipe) %>%
add_model(forest_model) %>%
fit_resamples(
indie_cv,
control = control_resamples(save_pred = TRUE)
)
indie_forest %>% get_pr()
## # A tibble: 3 x 3
## class precision recall
## <fct> <dbl> <dbl>
## 1 Indie Party 0.409 0.45
## 2 Indie Pop 0.6 0.6
## 3 Indie Workout 0.278 0.25
Random forests give us the best-quality ranking of feature importance, and we can plot it with randomForest::varImpPlot. Again, it is clear that timbre, specifically Component 1 (power) and Component 11, is important. Note that because random forests are indeed random, the accuracy and feature rankings will vary (slightly) every time you re-run the code.
workflow() %>%
add_recipe(indie_recipe) %>%
add_model(forest_model) %>%
fit(indie_features) %>%
pluck("fit", "fit", "fit") %>%
ranger::importance() %>%
enframe() %>%
mutate(name = fct_reorder(name, value)) %>%
ggplot(aes(name, value)) +
geom_col() +
coord_flip() +
theme_minimal() +
labs(x = NULL, y = "Importance")
Armed with this feature set, perhaps we can make a better plot. Itโs clear that the party playlist is somewhat higher in engergy and higher on Components 1 and 2 than the pop list, but the workout list only seems to differ on Component 2.
indie_features %>%
ggplot(aes(x = c01, y = c02, colour = playlist, size = energy)) +
geom_point(alpha = 0.8) +
scale_color_viridis_d() +
labs(
x = "Timbre Component 1",
y = "Timbre Component 2",
size = "Energy",
colour = "Playlist"
)
Can you get a clearer visualisation by using a different set of the top features from the random forest?
The Bibliothรจque nationale de France (BnF) makes a large portion of its music collection available on Spotify, including an eclectic collection of curated playlists. The defining musical characteristics of these playlists are sometimes unclear: for example, they have a Halloween playlist. Perhaps clustering can help us organise and describe what kinds of musical selections make it into the BnFโs playlist.
We begin by loading the playlist and summarising the pitch and timbre features, just like last week. Note that, also like last week, we use compmus_c_transpose to transpose the chroma features so that โ depending on the accuracy of Spotifyโs key estimation โ we can interpret them as if every piece were in C major or C minor.
halloween <-
get_playlist_audio_features("bnfcollection", "1vsoLSK3ArkpaIHmUaF02C") %>%
add_audio_analysis() %>%
mutate(
segments = map2(segments, key, compmus_c_transpose),
pitches =
map(segments,
compmus_summarise, pitches,
method = "mean", norm = "manhattan"
),
timbre =
map(
segments,
compmus_summarise, timbre,
method = "mean"
)
) %>%
mutate(pitches = map(pitches, compmus_normalise, "clr")) %>%
mutate_at(vars(pitches, timbre), map, bind_rows) %>%
unnest(cols = c(pitches, timbre))
Remember that in the tidyverse approach, we can preprocess data with a recipe. In this case, instead of a label that we want to predict, we start with a label that will make the cluster plots readable. For most projects, the track name will be the best choice (although feel free to experiment with others). The code below uses str_trunc to clip the track name to a maximum of 20 characters, again in order to improve readability. The other change from last week is column_to_rownames, which is necessary for the plot labels to appear correctly.
Last week we also discussed that although standardising variables with step_center to make the mean 0 and step_scale to make the standard deviation 1 is the most common approach, sometimes step_range is a better alternative, which squashes or stretches every features so that it ranges from 0 to 1. For most classification algorithms, the difference is small; for clustering, the differences can be more noticable. Itโs wise to try both.
halloween_juice <-
recipe(
track.name ~
danceability +
energy +
loudness +
speechiness +
acousticness +
instrumentalness +
liveness +
valence +
tempo +
duration +
C + `C#|Db` + D + `D#|Eb` +
E + `F` + `F#|Gb` + G +
`G#|Ab` + A + `A#|Bb` + B +
c01 + c02 + c03 + c04 + c05 + c06 +
c07 + c08 + c09 + c10 + c11 + c12,
data = halloween
) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors()) %>%
# step_range(all_predictors()) %>%
prep(halloween %>% mutate(track.name = str_trunc(track.name, 20))) %>%
juice() %>%
column_to_rownames("track.name")
When using step_center and step_scale, then the Euclidean distance is usual. When using step_range, then the Manhattan distance is also a good choice: this combination is known as Gowerโs distance and has a long history in clustering.
After you have this section of the notebook working with Euclidean distance, try modifying it to use Gowerโs distance.
halloween_dist <- dist(halloween_juice, method = "euclidean")
There are three primary types of linkage: single, average, and complete. Usually average or complete give the best results. We can use the ggendrogram function to make a more standardised plot of the results.
halloween_dist %>%
hclust(method = "single") %>% # Try single, average, and complete.
dendro_data() %>%
ggdendrogram()
Try all three of these linkages. Which one looks the best? Which one sounds the best (when you listen to the tracks on Spotify)? Can you guess which features are separating the clusters?
Especially for storyboards, it can be helpful to visualise hierarchical clusterings along with heatmaps of feature values. We can do that with heatmaply. Although the interactive heatmaps are flashy, think carefully when deciding whether this representation is more helpful for your storyboard than the simpler dendrograms above.
heatmaply(
halloween_juice,
hclustfun = hclust,
hclust_method = "average", # Change for single, average, or complete linkage.
dist_method = "euclidean"
)
## Warning in doTryCatch(return(expr), name, parentenv, handler): unable to load shared object '/Library/Frameworks/R.framework/Resources/modules//R_X11.so':
## dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib
## Referenced from: /Library/Frameworks/R.framework/Versions/4.0/Resources/modules/R_X11.so
## Reason: image not found
Can you determine from the heatmap which features seem to be the most and least useful for the clustering?